Partial coherence enhances parallelized photonic computing

Advancements in optical coherence control1–5 have unlocked many cutting-edge applications, including long-haul communication, light detection and ranging (LiDAR) and optical coherence tomography6–8. Prevailing wisdom suggests that using more coherent light sources leads to enhanced system performance and device functionalities9–11. Our study introduces a photonic convolutional processing system that takes advantage of partially coherent light to boost computing parallelism without substantially sacrificing accuracy, potentially enabling larger-size photonic tensor cores. The reduction of the degree of coherence optimizes bandwidth use in the photonic convolutional processing system. This breakthrough challenges the traditional belief that coherence is essential or even advantageous in integrated photonic accelerators, thereby enabling the use of light sources with less rigorous feedback control and thermal-management requirements for high-throughput photonic computing. Here we demonstrate such a system in two photonic platforms for computing applications: a photonic tensor core using phase-change-material photonic memories that delivers parallel convolution operations to classify the gaits of ten patients with Parkinson’s disease with 92.2% accuracy (92.7% theoretically) and a silicon photonic tensor core with embedded electro-absorption modulators (EAMs) to facilitate 0.108 tera operations per second (TOPS) convolutional processing for classifying the Modified National Institute of Standards and Technology (MNIST) handwritten digits dataset with 92.4% accuracy (95.0% theoretically).


Directional coupler design for the same output contribution from different unit cells
The power splitters are equipped with active thermo-optic phase shifters to tune light routing across the photonic memory crossbar array.By supplying the proper voltage to the phase shifter, light can be completely routed to the next cell in a row for weight setting, or routed to the common bus waveguide in a row for accumulation (Figure .S5b).The tunable power splitters of photonic memory crossbar array were controlled by a digital signal processor (DSP, Analog Device DC2026) to ensure that all pump power was concentrated into PCM of the target cell.
For example, to set  32 in Fig. 4a,  1 was used, VOA3 was on while VOA1 and VOA2 were off, so that the pump light was routed to Ch 3. Cell 31 was controlled to distribute all light into the top channel of its 2 × 2 MMI, and Cell 32 was controlled to distribute all light into the MMI bottom channel to efficiently set  32 .In this case, Cell 33 was idle.
Directional couplers (DC) are also carefully designed to ensure outputs from different unit cells have the same contribution.Since symmetric DCs are used to route weighted outputs from each cell into buses, the optical power in buses will partially couple back into cells.The coupling ratio of DCs in row n is designed to be 1

𝑛
. Consequently, the optical power received at the output waveguide column m from row n is which is balanced across all cells except different weights.
-End of supplementary text 1

Mapping transmission levels to negative weights
In our partially coherent photonic computing approach, the transmission levels of phase change material memory T are mapped to weights w in [- .This mapping approach can be implemented in hardware using the balanced detection scheme.
As shown in Figure .S7a, for every column implementing dot-product operation, we can add a reference column that stores all the reference transmission levels.The balanced photodetector will generate the mapped weight values that are allowed to be negative.The drawback of this scheme is the loss of half of the optical power.
Alternatively, instead of using a reference column for every dot-product operation, we can use only one reference column (Figure .S7b) and do the subtraction in software.This approach only leads to the loss of 1/(M+1) optical power, where M is the number of columns in the photonic tensor core.The running time of subtraction is thus O(M), which is small compared to O(M 2 ) MAC operations implemented by the photonic tensor core.
The need for such a balanced photodetection scheme is a consequence of the use of transmittance to represent weights.The transmittance, as a physical property, is non-negative.
The third approach to address the non-negative weights issue, instead of changing the hardware architecture, involves the modification of neural network to adapt to the non-negative nature.
The all-non-negative neural network has been investigated in Ref 1 and shows accuracy comparable to unconstrained neural networks for MNIST datasets, indicating all-non-negative neural network is possible, -End of supplementary text 3

Power efficiency estimation
The power of the partially coherent photonic computing system is attributed to 1) light source, 2) modulators for data loading, 3) electronics for data loading modulators, 4) weighting elements, 5) optical receivers, 6) ADCs.
Assuming a N×M photonic tensor core operating at f GSa/s, the expected power consumption will be: 1) P light Light source.We use 0.2 mW at each input channel.Considering a wall-plug efficiency of 3.1% of integrated ASE source 2 , the electrical power is 6.45 mW.The overall power consumption of light source will be: 2) P mod Modulators for data loading.The energy consumption of each IMEC EAM we use is 13.8 fJ/bit 3 .The overall power consumption of data loading modulators is: 13.8×10 -15 ×f×10 9 ×N = 13.8×f×N×10 - W 3) P mod-electronics .For data loading modulators, the electronics contain DAC, driver, and backends, which will consume 625 fJ/bit 4 .The overall power consumption of these electronics will be: 625×10 -15 ×f×10 9 ×N = 625×f×N×10 -6 W 4) P weigtht Weighting elements.The use of phase-change material as non-volatile photonic memory will consume no power as the weight matrix presents fixed kernels.

5) P rec
Receivers.A receiver containing photodetector, TIA, and buffer will consume 170 fJ/bit 5 .The overall power consumption of receivers will be: 170×10 -15 ×f×10 9 ×M = 170×f×M×10 -6 W 6) P ADC .An ADC will consume 6.25 pJ/bit 6 .The overall power consumption of ADCs will be: The total throughput of the photonic tensor core will be: where the factor of 2 takes into account two operations in one multiply-accumulate operation.
The energy efficiency of the photonic tensor core is defined as: Considering the presented 9×3 photonic tensor core working at 2 GSa/s data loading rate, the energy efficiency will be: 0.108 TOPS / (0.0580+0.0002+0.0112+0.0010+0.0375)W = 1 TOPS/W This 1 TOPS/W energy efficiency is equivalent to 1 TOPS/W in the latest Google TPUv4 7 .
Energy efficiency of our system surpassing Google TPUv4 can be achieved at a larger tensor core size, because the throughput scales with N 2 while the energy consumption scales with N.
A comparison of energy efficiency to other photonic computing systems is provided in Table.
-End of supplementary text 4

MNIST fashion products dataset results
The results are qualitatively similar to the MNIST handwritten digits dataset, wherein edges are effectively extracted amidst certain background noise (

Limitation of partially coherent approach and comparison with coherent approach
Leveraging partially coherent light in a photonic tensor core enables the distribution of light within the same optical window across the entire core by eliminating phase fluctuation, thereby effectively enhancing data processing parallelism.On the contrary, utilizing coherent light necessitates distinct wavelengths for each input channel to circumvent interference.Given an N-input-channel photonic tensor core, an available total bandwidth of 100 nm (e.g., 1500 nm to 1600 nm), and a minimum 0.8-nm spacing between adjacent wavelength channels (aligned with the ITU grid for 50 GHz modulation), the coherent approach can offer a parallelism P=(100/0.8/N),while the partially coherent approach provides a P=100/OB, independent of the number of input channels N, where OB is the optical bandwidth of partially coherent light.
Meanwhile, the lower SNR inherent to filtered ASE partially coherent light should be considered.This SNR is dependent on both the optical bandwidth and the intensity received at the photodetector.
Under these assumptions and analyses, Figure .S22 shows the dependency of parallelism and SNR on photonic tensor core size, optical bandwidth, and intensity received at the photodetector.At N=20, both the coherent and partially coherent approaches reach 4-bit resolution.The parallelism of coherent approach is 6, while the parallelism of partially coherent approach is 12 and 25 with OB of 8 nm and 4 nm respectively.The partially coherent approach exhibits more parallelism advantages for large photonic tensor cores (N≥42).Within 42≤N≤62, the coherent approach's parallelism is 2, contrasting with the partially coherent approach's parallelism of 6, 12, and 25 with OB of 16 nm, 8 nm, and 4 nm, respectively.For 63≤N≤125, the parallelism of coherent approach diminishes to 1, and for N>125 it becomes inoperable.
However, a partially coherent approach consistently preserves its parallelism.The limitations of partially coherent approach are related to the reduced SNR.For a high photodetector receiving intensity of 3.33 mW, the SNR of coherent approach is higher than that of the partially coherent approach by 1-2 orders of magnitude, indicating superior computing accuracy.This SNR advantage diminishes to less than an order at a moderate 0.3 mW and becomes comparable at a lower 0.024 mW.Notably, in large-scale photonic tensor cores, the intensity received at the photodetector is usually compromised due to its distribution among numerous unit cells and accumulated insertion losses, resulting in a typical photodetector receiving intensity spanning 0.1 µW to 0.1 mW.In partially coherent systems, it is crucial to note that the requisite optical delay lines, employed to reduce phase sensitivity, can lead to very long waveguides.This could potentially result in a higher propagation loss and an additional footprint.To address this issue, we introduce an architectural design depicted in Figure .S23, wherein all optical delay lines are coiled around the photonic tensor core, thereby achieving a high area efficiency.

Solutions to address the long delay line issue
While the footprint issue can be addressed by the design shown in Fig. S23, the long optical delay line issue still requires attention.Although ultralow loss Si 3 N 4 waveguides of 0.1 dB/m 8 and 1 dB/m 9 have been reported, they are not foundry-available, which limits their scalability.Foundry-available silicon nitride-on-silicon platform is a feasible solution to address the long optical delay line issue 11 .The silicon nitride-on-silicon platform can harness the low-loss advantage of silicon nitride and active device availability of silicon.If we assume practically that the silicon nitride waveguide has a loss of 0.4 dB/cm 12 ; and we further require the longest delay line to have a total loss less than 3 dB, the longest delay cannot exceed 75 mm.This 75-mm delay line limits the number of input channels to 59 if an optical bandwidth of 4 nm is needed (Fig. S24a and Fig. S24c).The presented numbers of channels and optical bandwidth are given as examples only.However, we note that this long delay line issue only exists if we assume the use of only one ASE source for the whole system.In practical implementations, an array of independent ASE sources working at the same wavelength can be employed, with each ASE source driving a few tens of input waveguide channels.These independent ASE sources are uncorrelated, eliminating the need for longer delay lines to overcome the coherence length of a single source.Waveguide integrated ASE sources have been demonstrated by a few groups 2,13 .From a system point of view, their integration into a photonic tensor core is technically similar to laser/waveguide integration.The schematic of a system with multiple ASE sources is presented in  -End of supplementary text 6

Partially coherent filtered ASE
Parallelism @ N=125 100/BW a : 125 @ 0.8 nm BW 1 b 1 100/BW: 125 @ 0.8 nm BW 50 @ 2.0 nm BW 12 @ 8.0 nm BW SNR @ 2 GSa/s and 0.1 mW intensity at photodetector 51.0 51.0 51.0 8.9 @ 0.8 nm BW 14.9 @ 2.0 nm BW 28.6 @ 8.0 nm BW SNR @ 2 Gsa/s and 0.01 mW intensity at photodetector 5.6 5.6 5.6 4.3 @ 0.8 nm BW 5.0 @ 2.0 nm BW 8.3 @ 8.0 nm BW a Dispersion is not considered.The phase map designated for one wavelength could be distorted at a different wavelength.And the cumulative phase error can be large.b We assume that 125 wavelength channels is possible for MRR weight bank, though it is expected that only 60 channels are available 17 .
Our partially coherent approach uniquely features phase-insensitivity throughout the whole system, addressing the predominant challenge of phase fluctuation that hinders most large-scale photonic circuits, thus promising the potential to scaling up.The amplitude modulation for weight encoding in our approach has a low temperature sensitivity of 0.08 dB/K 16 , which, if necessary, can be eliminated by a TEC controller.The adoption of partial coherence enables the distribution of light within the same optical window across the entire photonic processor while preserving multiplexing capability.Consequently, it achieves high parallelism, comparable to MZI meshes, and superior to MRR weight banks and coherent comb-based photonic tensor cores.While the theoretical scalability of the MZI mesh is acknowledged, practical upscaling is impeded by accumulative random phase fluctuation and thermal crosstalk, which present formidable challenges for mitigation.A notable limitation of the partially coherent approach lies in its lower SNR, intrinsically linked to the stochastic property of the ASE light source.At an elevated optical power (> 1 mW received at the photodetector), the coherent approach provides markedly higher SNR than their partially coherent counterparts.However, at an intensity received by the photodetector ranging from 0.1 µW to 0.1 mW, which is the range of interest to many applications, the SNRs become comparable.Improved SNR could potentially be realized by replacing EDFA ASE with broadband superluminescent diodes (SLED) 18 and further coupling SLEDs with saturated semiconductor optical amplifiers (SOA) to suppress noise 19 .

Figure. S2
Figure.S2 Impact of non-Gaussian-shaped spectrum.a, Spectrum of a more-Gaussian-shaped partially coherent source and a less-Gaussian-shaped partially coherent source.b, Comparison of degree of coherence as a function of length difference.

Figure .
Figure.S3 Measured noise as a function of signal (intensity received at the photodetector).

Figure. S6
Figure.S6 Weight setting using a phase-change material photonic memory with 4-bit resolution.The error bars represent the standard deviation in transmission levels when the memory is set to a specific level.

Figure. S7
Figure.S7 Two possible hardware implementations for negative weights.a. Balanced photodetection scheme.b.Photonic tensor core with an additional reference column.The following subtractions are done in software.-End of supplementary text 2

Figure. S9
Figure.S9 Convolution results obtained using CPU.The error bands represent the standard deviation of convolution results from 50 gait signals generated by the same patient, showing the variation of gait signals from this patient.

Figure.S10 Convolution results of gait signals from Parkinson's disease patients 1 - 5 .
Figure.S10 Convolution results of gait signals from Parkinson's disease patients 1-5.The error bands represent the standard deviation of convolution results from 50 gait signals generated by the same patient, showing the variation of gait signals from this patient.

Figure.S11 Convolution results of gait signals from Parkinson's disease patients 6 - 10 .
Figure.S11 Convolution results of gait signals from Parkinson's disease patients 6-10.The error bands represent the standard deviation of convolution results from 50 gait signals generated by the same patient, showing the variation of gait signals from this patient.

Figure. S12
Figure.S12 Confusion maps of CNN classification results for identification of Parkinson's disease patients using their gaits.a, Without convolution.b, Convolution performed by CPU.c, Convolution performed by partially coherent system.d, Convolution performed by coherent system.

Figure. S13
Figure.S13 Evolution of CNN loss and accuracy with increasing epochs in the identification of Parkinson's disease patients using their gaits.

Figure. S14
Figure.S14 Configuration and data flow of an FPGA-controlled photonic EAM tensor core operating at 2 GSa/s.

Figure. S15
Figure.S15 Confusion maps of CNN classification results of MNIST handwritten dataset.a, Without convolution.b, Convolution performed by CPU.c, Convolution performed by partially coherent system without averaging.d, Convolution performed by partially coherent system with 4-point average.

Figure. S16
Figure.S16 Evolution of CNN loss and accuracy with increasing epochs in the classification of MNIST handwritten digits dataset.
Figure.S17a), and this noise is reduced by 4-point average (Figure.S17b&c).Nonetheless, quantitatively, the classification accuracy achieved in the MNIST fashion products dataset is lower compared to the handwritten digits dataset.As shown in Figure.S17d, using convolution results obtained by CPU, the classification accuracy is 82.8%.However, while the accuracy remains high at 80.2% with 4point average, it declines to 74.6% without averaging.Associated confusion maps and evolution of loss and accuracy with respect to increasing epochs are presented in Figure.S18 and Figure.S19.This discrepancy in accuracy can be explained by the different noise resilience between the MNIST handwritten digits dataset and MNIST fashion product dataset under the specific CNN configuration (Figure.S20), and could be improved by enhancing the SNR of partially coherent systems or adjusting the CNN configuration.

Figure. S17
Figure.S17 Convolution results and CNN classification accuracy of MNIST fashion products dataset.a, An example of trouser edge detection.b, Convolution accuracy without averaging.c, Convolution accuracy with 4point average.A total of 100,000 pairs of expected and measured results are compared in both b and c.The insets show the Gaussian distribution of normalized errors.d, Comparison of CNN classification results.

Figure. S18
Figure.S18 Confusion maps of CNN classification results of MNIST fashion products dataset.a, Without convolution.b, Convolution performed by CPU.c, Convolution performed by partially coherent system without averaging.d, Convolution performed by partially coherent system with 4-point average.

Figure. S19
Figure.S19 Evolution of CNN loss and accuracy with increasing epochs in the classification of MNIST fashion products dataset.

Figure. S20 -End of supplementary text 5 Figure .
Figure.S20 Impact of SNR on classification accuracy.a, MNIST handwritten digits dataset.b, MNIST fashion products dataset.-End of supplementary text 5

Figure. S22
Figure.S22 Dependence of parallelism and SNR on the size of photonic tensor core, optical bandwidth, and intensity received at the photodetector.

Figure. S23
Figure.S23 Proposed architecture design for area-efficient optical delays.Consider an N×N photonic tensor core, with a ΔL Ch spacing between adjacent input

Figure. S24
Figure.S24 Dependence of the longest delay line length, percentage area increase, and the minimum feasible optical bandwidth on the tensor core size and adjacent channel spacing.

Figure. S25 .
Delay lines are sketched in spirals instead of wrapping around the processor for visual clarity.All ASE sources operate at the same optical band.The number of input channels driven by each ASE source is 59 for consistency with the example given above.This number can be adjusted depending on the waveguide loss, adjacent channel distance, and required optical bandwidth.

Figure. S25
Figure.S25 Schematic of a system with multiple ASE sources.The total number of input channels in this schematic is 59×N, where N is the number of ASE sources.